Search CORE

20 research outputs found

Exploring Scientific Application Performance Using Large Scale Object Storage

Author: Chien Steven Wei-der
Karim Rami
Laure Erwin
Markidis Stefano
Narasimhamurthy Sai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/07/2018
Field of study

One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage technology in cloud computing and is more frequently proposed for HPC workload to address and improve the current scalability and performance of I/O in scientific applications. While object storage is a promising technology, it is still unclear how scientific applications will use object storage and what the main performance benefits will be. This work addresses these questions, by emulating an object storage used by a traditional scientific application and evaluating potential performance benefits. We show that scientific applications can benefit from the usage of object storage on large scales.Comment: Preprint submitted to WOPSSS workshop at ISC 201

arXiv.org e-Print Archive

Crossref

Characterizing Deep-Learning I/O Workloads in TensorFlow

Author: Chien Steven W. D.
Herman Pawel
Laure Erwin
Markidis Stefano
Narasimhamurthy Sai
Santos Luis
Sishtla Chaitanya Prasad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/10/2018
Field of study

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.Comment: Accepted for publication at pdsw-DISCS 201

arXiv.org e-Print Archive

Crossref

Using message logs and resource use data for cluster failure diagnosis

Author: Barth Bill
Browne James C.
Chuah Edward
Gurumdimma Nentawe
Jhumka Arshad
Narasimhamurthy Sai
Publication venue
Publication date: 02/02/2017
Field of study

Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Recent availability of resource use data provides another potentially useful source of data for failure detection and diagnosis. Early work combining message logs and resource use data for failure diagnosis has shown promising results. This paper describes the CRUMEL framework which implements a new approach to combining rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and correlates these patterns by time with system failures. Application of CRUMEL to data from the Ranger supercomputer has yielded improved diagnoses over previous research. CRUMEL has: (i) showed that more events correlated with system failures can only be identified by applying different correlation algorithms, (ii) confirmed six groups of errors, (iii) identified Lustre I/O resource use counters which are correlated with occurrence of Lustre faults which are potential flags for online detection of failures, (iv) matched the dates of correlated error events and correlated resource use with the dates of compute node hangups and (v) identified two more error groups associated with compute node hang-ups. The pre-processed data will be put on the public domain in September, 2016

Warwick Research Archives Portal Repository

Pitfalls in machine learning‐based assessment of tumor‐infiltrating lymphocytes in breast cancer: a report of the international immuno‐oncology biomarker working group

Author: Abduljabbar Khalid
Acosta haab Gabriela
Acs Balazs
Adams Sylvia
Akturk Guray
Almeida Jonas s
Alvarado‐cabrero Isabel
Amgad Mohamed
Azmoudeh‐ardalan Farid
Badve Sunil
Baharun Nurkhairul bariyah
Balslev Eva
Bartlett John
Bellolio Enrique r
Bheemaraju Vydehi
Blenman Kim rm
Botinelly mendonça fujimoto Luciana
Bouchmaa Najat
Broeckx Glenn
Burgues Octavio
Chardas Alexandros
Chon u cheang Maggie
Ciompi Francesco
Cooper Lee ad
Coosemans An
Corredor Germán
Dahl Anders b
Dantas portela Flavio luis
Deman Frederik
Demaria Sandra
Denkert Carsten
Doré hansen Johan
Dudgeon Sarah n
Ebstrup Thomas
Elghazawy Mahmoud
Fernandez‐martín Claudio
Fineberg Susan
Fox Stephen b
Gallagher William m
Giltnane Jennifer m
Gnjatic Sacha
Gonzalez‐ericsson Paula i
Grigoriadis Anita
Gupta Rajarsi
Halama Niels
Hanna Matthew g
Harbhajanka Aparna
Hart Steven n
Hartman Johan
Hauberg Søren
Hewitt Stephen
Hida Akira i
Horlings Hugo m
Husain Zaheed
Hytopoulos Evangelos
Irshad Sheeba
Jahangir Chowdhury arif
Janssen Emiel am
Kahila Mohamed
Kataoka Tatsuki r
Kawaguchi Kosuke
Kharidehal Durga
Khiroya Reena
Khramtsov Andrey i
Kiraz Umay
Kirtani Pawan
Kodach Liudmila l
Korski Konstanty
Kos Zuzana
Kovács Anikó
Laenkholm Anne‐vibeke
Lang‐schwarz Corinna
Larsimont Denis
Lennerz Jochen k
Lerousseau Marvin
Li Xiaoxian
Loi Sherene
Loibl Sibylle
Ly Amy
Madabhushi Anant
Maley Sai k
Manur narasimhamurthy Vidya
Marks Douglas k
Mcdonald Elizabeth s
Mehrotra Ravi
Michiels Stefan
Minhas Fayyaz ul amir afsar
Mittal Shachi
Moore David a
Mushtaq Shamim
Nighat Hussain
Page David b
Papathomas Thomas
Penault‐llorca Frederique
Perera Rashindrie d
Pinard Christopher j
Pinto‐cardenas Juan carlos
Pruneri Giancarlo
Pusztai Lajos
Rahman Arman
Rajpoot Nasir mahmood
Rapoport Bernardo leon
Rau Tilman t
Reis‐filho Jorge s
Ribeiro Joana m
Rimm David
Roslind Anne
Salgado Roberto
Salto‐tellez Manuel
Saltz Joel
Savas Peter
Sayed Shahin
Scott Ely
Siziopikou Kalliopi p
Sotiriou Christos
Specht stovgaard Elisabeth
Stenzinger Albrecht
Sughayer Maher a
Sur Daniel
Symmans Fraser
Tanaka Sunao
Taxter Timothy
Tejpar Sabine
Teuwen Jonas
Thagaard Jeppe
Thompson E aubrey
Tramm Trine
Tran William t
Van der laak Jeroen
Van diest Paul j
Verbandt Sara
Verghese Gregory e
Viale Giuseppe
Vieth Michael
Vincent‐salomon Anne
Wahab Noorul
Walter Thomas
Waumans Yannick
Wen Hannah y
Yang Wentao
Yuan Yinyin
Zin Reena md
Publication venue
Publication date: 01/01/2023
Field of study

The clinical significance of the tumor-immune interaction in breast cancer (BC) has been well established, and tumor-infiltrating lymphocytes (TILs) have emerged as a predictive and prognostic biomarker for patients with triple-negative (estrogen receptor, progesterone receptor, and HER2 negative) breast cancer (TNBC) and HER2-positive breast cancer. How computational assessment of TILs can complement manual TIL-assessment in trial- and daily practices is currently debated and still unclear. Recent efforts to use machine learning (ML) for the automated evaluation of TILs show promising results. We review state-of-the-art approaches and identify pitfalls and challenges by studying the root cause of ML discordances in comparison to manual TILs quantification. We categorize our findings into four main topics; (i) technical slide issues, (ii) ML and image analysis aspects, (iii) data challenges, and (iv) validation issues. The main reason for discordant assessments is the inclusion of false-positive areas or cells identified by performance on certain tissue patterns, or design choices in the computational implementation. To aid the adoption of ML in TILs assessment, we provide an in-depth discussion of ML and image analysis including validation issues that need to be considered before reliable computational reporting of TILs can be incorporated into the trial- and routine clinical management of patients with TNBC

Queen's University Belfast Research Portal

HAL Clermont Université

Edinburgh Research Explorer

Online Research Database In Technology

HAL UVSQ

Heterogeneous High Performance Computing

Author: Carpenter Paul
Narasimhamurthy Sai
Suarez Estela
Utz Uwe-Haus
Publication venue: Zenodo
Publication date: 01/01/2022
Field of study

Modern HPC systems are becoming increasingly heterogeneous, affecting all components of HPC systems, from the processing units, through memory hierarchies and network components to storage systems. This trend is on the one hand due to the need to build larger, yet more energy efficient systems, and on the other hand it is caused by the need to optimise (parts of the) systems for certain workloads. In fact, it is not only the systems themselves that are becoming more heterogeneous, but also scientific and industrial applications are increasingly combining different technologies into complex workflows, including simulation, data analytics, visualisation, and artificial intelligence/machine learning. Different steps in these workflows call for different hardware and thus today’s HPC systems are often composed of different modules optimised to suit certain stages in these workflows. While the trend towards heterogeneity is certainly helpful in many aspects, it makes the task of programming these systems and using them efficiently much more complicated. Often, a combination of different programming models is required and selecting suitable technologies for certain tasks or even parts of an algorithm is difficult. Novel methods might be needed for heterogeneous components or be only facilitated by them. And this trend is continuing, with new technologies around the corner that will further increase heterogeneity, e.g. neuromorphic or quantum accelerators, in-memory-computing, and other non-von-Neumann approaches. In this paper, we present an overview of the different levels of heterogeneity we find in HPC technologies and provide recommendations for research directions to help deal with the challenges they pose. We also point out opportunities that particularly applications can profit from by exploiting these technologies. Research efforts will be needed over the full spectrum, from system architecture, compilers and programming models/languages, to runtime systems, algorithms and novel mathematical approaches

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Juelich Shared Electronic Resources

Data Storage in Cloud Based Real-Time Environments

Author: Fabio Checconi
Malcolm Muggeridge
Sai Narasimhamurthy
Stefan Waldschmidt
Tommaso Cucinotta
Publication venue: 'IGI Global'
Publication date: 01/01/2011
Field of study

Archivio della ricerca della Scuola Superiore Sant'Anna

Linking resource usage anomalies with system failures from cluster log data

Author: Barth Bill
Browne James C.
Chuah Edward
Hammond J.
Jhumka Arshad
Narasimhamurthy Sai
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/11/2013
Field of study

Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job

Warwick Research Archives Portal Repository